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ABSTRACT 

While computer and communication technologies have pro¬ 
vided effective means to scale up many aspects of education, 
the submission and grading of assessments such as homework 
assignments and tests remains a weak link. In this paper, we 
study the problem of automatically grading the kinds of open 
response mathematical questions that figure prominently in 
STEM (science, technology, engineering, and mathematics) 
courses. Our data-driven framework for mathematical lan¬ 
guage processing (MLP) leverages solution data from a large 
number of learners to evaluate the correctness of their solu¬ 
tions, assign partial-credit scores, and provide feedback to 
each learner on the likely locations of any errors. MLP takes 
inspiration from the success of natural language processing 
for text data and comprises three main steps. First, we con¬ 
vert each solution to an open response mathematical ques¬ 
tion into a series of numerical features. Second, we clus¬ 
ter the features from several solutions to uncover the struc¬ 
tures of correct, partially correct, and incorrect solutions. We 
develop two different clustering approaches, one that lever¬ 
ages generic clustering algorithms and one based on Bayesian 
nonparametrics. Third, we automatically grade the remain¬ 
ing (potentially large number of) solutions based on their as¬ 
signed cluster and one instructor-provided grade per cluster. 
As a bonus, we can track the cluster assignment of each step 
of a multistep solution and determine when it departs from a 
cluster of correct solutions, which enables us to indicate the 
likely locations of errors to learners. We test and validate 
MLP on real-world MOOC data to demonstrate how it can 
substantially reduce the human effort required in large-scale 
educational platforms. 

Author Keywords 

Automatic grading. Machine learning. Clustering, Bayesian 
nonparametrics. Assessment, Feedback, Mathematical 
language processing 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than 
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, 
or republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. Request permissions from permissions@acm.org. 

L@S’15, March 14-March 15, 2015, Vancouver, Canada. 

Copyright © 2015 ACM ISBN/15/03...$15.00. 


INTRODUCTION 

Large-scale educational platforms have the capability to rev¬ 
olutionize education by providing inexpensive, high-quality 
learning opportunities for millions of learners worldwide. 
Examples of such platforms include massive open online 
courses (MOOCs) 16, 7, 9, 10, 16, 42], intelligent tutoring 
systems 143], computer-based homework and testing systems 
11, 31, 38, 40], and personalized learning systems 124]. While 
computer and communication technologies have provided ef¬ 
fective means to scale up the number of learners viewing 
lectures (via streaming video), reading the textbook (via the 
web), interacting with simulations (via a graphical user in¬ 
terface), and engaging in discussions (via online forums), the 
submission and grading of assessments such as homework as¬ 
signments and tests remains a weak link. 

There is a pressing need to find new ways and means to au¬ 
tomate two critical tasks that are typically handled by the in¬ 
structor or course assistants in a small-scale course: (/) grad¬ 
ing of assessments, including allotting partial credit for par¬ 
tially correct solutions, and (ii) providing individualized feed¬ 
back to learners on the locations and types of their errors. 

Substantial progress has been made on automated grading 
and feedback systems in several restricted domains, including 
essay evaluation using natural language processing (NLP) 11, 
33], computer program evaluation 112, 15, 29, 32, 34], and 
mathematical proof verification 18, 19, 21]. 

In this paper, we study the problem of automatically grading 
the kinds of open response mathematical questions that fig¬ 
ure prominently in STEM (science, technology, engineering, 
and mathematics) education. To the best of our knowledge, 
there exist no tools to automatically evaluate and allot partial- 
credit scores to the solutions of such questions. As a result, 
large-scale education platforms have resorted either to over¬ 
simplified multiple choice input and binary grading schemes 
(correct/incorrect), which are known to convey less informa¬ 
tion about the learners’ knowledge than open response ques¬ 
tions 117], or peer-grading schemes 125, 26], which shift the 
burden of grading from the course instructor to the learners.^ 


^ While peer grading appears to have some pedagogical value for 
learners [30], each learner typically needs to grade several solutions 
from other learners for each question they solve, in order to obtain 
an accurate grade estimate. 
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Figure 1: Example solutions to the question “Find the deriva¬ 
tive of {x^ + sinx)/e^” that were assigned scores of 3, 2, 1 
and 0 out of 3, respectively, by our MLP-B algorithm. 
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the first expression in the third expression 

Figure 2: Examples of two different yet correct paths to solve 
the question “Simplify the expression + x + sin^ x -h 
cos^ x){2x — 3).” 


Main Contributions 

In this paper, we develop a data-driven framework for math¬ 
ematical language processing (MLP) that leverages solution 
data from a large number of learners to evaluate the correct¬ 
ness of solutions to open response mathematical questions, 
assign partial-credit scores, and provide feedback to each 
learner on the likely locations of any errors. The scope of our 
framework is broad and covers questions whose solution in¬ 
volves one or more mathematical expressions. This includes 
not just formal proofs but also the kinds of mathematical cal¬ 
culations that figure prominently in science and engineering 
courses. Examples of solutions to two algebra questions of 
various levels of correctness are given in Figures 1 and 2. In 
this regard, our work differs significantly from that of [8], 
which focuses exclusively on evaluating logical proofs. 

Our MLP framework, which is inspired by the success of 
NLP methods for the analysis of textual solutions (e.g., es¬ 
says and short answer), comprises three main steps. 


First, we convert each solution to an open response mathe¬ 
matical question into a series of numerical features. In deriv¬ 
ing these features, we make use of symbolic mathematics to 
transform mathematical expressions into a canonical form. 

Second, we cluster the features from several solutions to un¬ 
cover the structures of correct, partially correct, and incorrect 
solutions. We develop two different clustering approaches. 
MLP-S uses the numerical features to define a similarity score 
between pairs of solutions and then applies a generic cluster¬ 
ing algorithm, such as spectral clustering (SC) [22] or affinity 
propagation (AP) [11]. We show that MLP-S is also useful 
for visualizing mathematical solutions. This can help instruc¬ 
tors identify groups of learners that make similar errors so 
that instructors can deliver personalized remediation. MLP-B 
defines a nonparametric Bayesian model for the solutions and 
applies a Gibbs sampling algorithm to cluster the solutions. 

Third, once a human assigns a grade to at least one solution 
in each cluster, we automatically grade the remaining (po¬ 
tentially large number of) solutions based on their assigned 
cluster. As a bonus, in MLP-B, we can track the cluster as¬ 
signment of each step in a multistep solution and determine 
when it departs from a cluster of correct solutions, which en¬ 
ables us to indicate the likely locations of errors to learners. 

In developing MLP, we tackle three main challenges of ana¬ 
lyzing open response mathematical solutions. First, solutions 
might contain different notations that refer to the same math¬ 
ematical quantity. For instance, in Figure 1, the learners use 
both e~^ and ^ to refer to the same quantity. Second, some 
questions admit more than one path to the correct/incorrect 
solution. For instance, in Figure 2 we see two different yet 
correct solutions to the same question. It is typically infea¬ 
sible for an instructor to enumerate all of these possibilities 
to automate the grading and feedback process. Third, numer¬ 
ically verifying the correctness of the solutions does not al¬ 
ways apply to mathematical questions, especially when sim¬ 
plifications are required. For example, a question that asks to 
simplify the expression sin^ x-hcos^ x-\-x can have both l-\-x 
and sin^ x + cos^ x -h x as numerically correct answers, since 
both these expressions output the same value for all values of 
X. However, the correct answer is 1since the question ex¬ 
pects the learners to recognize that sin^ x-\-cos^ x = 1. Thus, 
methods developed to check the correctness of computer pro¬ 
grams and formulae by specifying a range of different inputs 
and checking for the correct outputs, e.g., [32], cannot always 
be applied to accurately grade open response mathematical 
questions. 

Related Work 

Prior work has led to a number of methods for grading and 
providing feedback to the solutions of certain kinds of open 
response questions. A linear regression-based approach has 
been developed to grade essays using features extracted from 
a training corpus using Natural Language Processing (NLP) 
[1, 33]. Unfortunately, such a simple regression-based model 
does not perform well when applied to the features extracted 
from mathematical solutions. Several methods have been de¬ 
veloped for automated analysis of computer programs [15, 
32]. However, these methods do not apply to the solutions 



to open response mathematical questions, since they lack the 
structure and compilability of computer programs. Several 
methods have also been developed to check the correctness of 
the logic in mathematical proofs 18, 19, 21]. However, these 
methods apply only to mathematical proofs involving logical 
operations and not the kinds of open-ended mathematical cal¬ 
culations that are often involved in science and engineering 
courses. 

The idea of clustering solutions to open response questions 
into groups of similar solutions has been used in a number 
of previous endeavors: 12, 5] uses clustering to grade short, 
textual answers to simple questions; 123] uses clustering to 
visualize a large collection of computer programs; and 128] 
uses clustering to grade and provide feedback on computer 
programs. Although the high-level concept underlying these 
works is resonant with the M LP framework, the feature build¬ 
ing techniques used in M LP are very different, since the struc¬ 
ture of mathematical solutions differs significantly from short 
textual answers and computer programs. 

This paper is organized as follows. In the next section, we 
develop our approach to convert open response mathemati¬ 
cal solutions to numerical features that can be processed by 
machine learning algorithms. We then develop MLP-S and 
MLP-B and use real-world MOOC data to showcase their 
ability to accurately grade a large number of solutions based 
on the instructor’s grades for only a small number of solu¬ 
tions, thus substantially reducing the human effort required 
in large-scale educational platforms. We close with a discus¬ 
sion and perspectives on future research directions. 


MLP FEATURE EXTRACTION 

The first step in our MLP framework is to transform a collec¬ 
tion of solutions to an open response mathematical question 
into a set of numerical features. In later sections, we show 
how the numerical features can be used to cluster and grade 
solutions as well as generate informative learner feedback. 

A solution to an open response mathematical question will in 
general contain a mixture of explanatory text and core math¬ 
ematical expressions. Since the correctness of a solution de¬ 
pends primarily on the mathematical expressions, we will ig¬ 
nore the text when deriving features. However, we recognize 
that the text is potentially very useful for automatically gener¬ 
ating explanations for various mathematical expressions. We 
leave this avenue for future work. 

A workhorse of NLP is the bag-of-words model; it has found 
tremendous success in text semantic analysis. This model 
treats a text document as a collection of words and uses the 
frequencies of the words as numerical features to perform 
tasks like topic classification and document clustering 14, 5]. 

A solution to an open response mathematical question con¬ 
sists of a series of mathematical expressions that are chained 
together by text, punctuation, or mathematical delimiters in¬ 
cluding =, <, >, oc, etc. For example, the solution 
in Figure 1(b) contains the expressions {{x^ -f sinx)/e^)', 
{{3x‘^ + cosx)e^ — {x^ + sinx)e^))/e^^, and {2x‘^ — x^ -\- 
cos X — sin x) /that are all separated by the delimiter “=”. 


MLP identifies the unique mathematical expressions con¬ 
tained in the learners’ solutions and uses them as features, 
effectively extending the bag-of-words model to use mathe¬ 
matical expressions as features rather than words. To coin a 
phrase, MLP uses a novel bag-of-expressions model. 

Once the mathematical expressions have been extracted from 
a solution, we parse them using SymPy, the open source 
Python library for symbolic mathematics 136].^ SymPy has 
powerful capability for simplifying expressions. For exam¬ 
ple, x‘^ + x‘^ can be simplified to and can be 

simplifed to e~^x‘^. In this way, we can identify the equiva¬ 
lent terms in expressions that refer to the same mathematical 
quantity, resulting in more accurate features. In practice for 
some questions, however, it might be necessary to tone down 
the level of SymPy’s simplification. For instance, the key to 
solving the question in Figure 2 is to simplify the expression 
using the Pythagorean identity sin^ x + cos^ x = 1. If SymPy 
is called on to perform such a simplification automatically, 
then it will not be possible to verify whether a learner has cor¬ 
rectly navigated the simplification in their solution. For such 
problems, it is advisable to perform only arithmetic simplifi¬ 
cations. 

After extracting the expressions from the solutions, we trans¬ 
form the expressions into numerical features. We assume 
that N learners submit solutions to a particular mathemati¬ 
cal question. Extracting the expressions from each solution 
using SymPy yields a total of V unique expressions across 
the N solutions. 

We encode the solutions in a integer-valued solution feature 
matrix Y G whose rows correspond to different ex¬ 

pressions and whose columns correspond to different solu¬ 
tions; that is, the (i, entry of Y is given by 

Yij = times expression i appears in solution j. 

Each column of Y corresponds to a numerical representation 
of a mathematical solution. Note that we do not consider the 
ordering of the expressions in this model; such an extension 
is an interesting avenue for future work. In this paper, we 
indicate in Y only the presence and not the frequency of an 
expression, i.e., Y G {0,1}^^^ and 

r 1 if expression i appears in solution j 

i 0 otherwise. 


The extension to encoding frequencies is straightforward. 

To illustrate how the matrix Y is constructed, consider the 
solutions in Figure 2(a) and (b). Across both solutions, there 
are 7 unique expressions. Thus, Y is a 7 x 2 matrix, with 
each row corresponding to a unique expression. Letting the 
first four rows of Y correspond to the four expressions in 
Figure 2(a) and the remaining three rows to expressions 2-4 
in Figure 2(b), we have 


11110 0 0 
10 0 1111 


^In particular, we use the parse _expr function. 



We end this section with the crucial observation that, for a 
wide range of mathematical questions, many expressions will 
be shared across learners’ solutions. This is true, for instance, 
in Figure 2. This suggests that there are a limited number of 
types of solutions to a question (both correct and incorrect) 
and that solutions of the same type tend tend to be similar to 
each other. This leads us to the conclusion that the N solu¬ 
tions to a particular question can be effectively clustered into 
K N clusters. In the next two sections, we will develop 
MLP-S and MLP-B, two algorithms to cluster solutions ac¬ 
cording to their numerical features. 

MLP-S: SIMILARITY-BASED CLUSTERING 

In this section, we outline MLP-S, which clusters and then 
grades solutions using a solution similarity-based approach. 

The MLP-S Model 

We start by using the solution features in Y to define a notion 
of similarity between pairs of solutions. Define the Y x Y 
similarity matrix S containing the pairwise similarities be¬ 
tween all solutions, with its (i, j)* entry the similarity be¬ 
tween solutions i and j 

a.. - (2) 

mm{y/yi,yjy^} 

The column vector y^ denotes the column of Y and corre¬ 
sponds to learner i’s solution. Informally, Si^j is the number 
of common expressions between solution i and solution j di¬ 
vided by the minimum of the number of expressions in solu¬ 
tions i and j. A large/small value of Si^j corresponds to the 
two solutions being similar/dissimilar. For example, the sim¬ 
ilarity between the solutions in Figure 1(a) and Figure 1(b) 
is 1 /3 and the similarity between the solutions in Figure 2(a) 
and Figure 2(b) is 1/2. S is symmetric, and 0 < Si^j < 1. 
Equation (2) is just one of any possible solution similarity 
metrics. We defer the development of other metrics to future 
work. 


belong to the same cluster. For each figure, we show a sam¬ 
ple solution from some of these clusters, with the boxed solu¬ 
tions corresponding to correct solutions. We can make three 
interesting observations from Figure 3: 

• In the top left figure, we cluster a solution with the final 
answer + cos x — {x^ + sin x))/e^ with a solution with 
the final answer 3^ + cos x — {x^ + sin x))/ e^. Although 
the later solution is incorrect, it contained a typographical 
error where 3 * x A 2 was typed as 3 A x A 2. MLP-S is 
able to identify this typographical error, since the expres¬ 
sion before the final solution is contained in several other 
correct solutions. 

• In the top right figure, the correct solution requires iden¬ 
tifying the trigonometric identify sin^ x + cos^ x = 1. 
The clustering algorithm is able to identify a subset of the 
learners who were not able to identify this relationship and 
hence could not simplify their final expression. 

• MLP-S is able to identify solutions that are strongly con¬ 
nected to each other. Such a visualization can be extremely 
useful for course instructors. For example, an instructor 
can easily identify a group of learners who lack mastery 
of a certain skill that results in a common error and adjust 
their course plan accordingly to help these learners. 

Auto-Grading via MLP-S 

Having clustered all solutions into a small number K of clus¬ 
ters, we assign the same grade to all solutions in the same 
cluster. If a course instructor assigns a grade to one solution 
from each cluster, then MLP-S can automatically grade the 
remaining N — K solutions. We construct the index set Xs of 
solutions that the course instructor needs to grade as 


r N 

\ argmax > 

1 u 


Clustering Solutions in MLP-S 

Having defined the similarity Si^j between two solutions i 
and j, we now cluster the N solutions into K N clusters 
such that the solutions within each cluster have high similarity 
score between them and solutions in different clusters have 
low similarity score between them. 

Given the similarity matrix S, we can use any of the mul¬ 
titude of standard clustering algorithms to cluster solutions. 
Two examples of clustering algorithms are spectral cluster¬ 
ing (SC) [22] and affinity propagation (AP) [11]. The SC 
algorithm requires specifying the number of clusters K as an 
input parameter, while the AP algorithm does not. 

Figure 3 illustrates how AP is able to identify clusters of sim¬ 
ilar solutions from solutions to four different mathematical 
questions. The figures on the top correspond to solutions to 
the questions in Figures 1 and 2, respectively. The bottom 
two figures correspond to solutions to two signal processing 
questions. Each node in the figure corresponds to a solution, 
and nodes with the same color correspond to solutions that 


where Ck represents the index set of the solutions in cluster 
k. In words, in each cluster, we select the solution having 
the highest similarity to the other solutions (ties are broken 
randomly) to include in Xs. We demonstrate the performance 
of auto-grading via MLP-S in the experimental results section 
below. 

MLP-B: BAYESIAN NONPARAMETRIC CLUSTERING 

In this section, we outline MLP-B, which clusters and then 
grades solutions using a Bayesian nonparameterics-based ap¬ 
proach. The MLP-B model and algorithm can be interpreted 
as an extension of the model in [44], where a similar approach 
is proposed to cluster short text documents. 

The MLP-B Model 

Following the key observation that the N solutions can be 
effectively clustered into K N clusters, let z be the N x 1 
cluster assignment vector, with Zj G {1,..., AT} denoting the 
cluster assignment of the j* solution with j G N}. 

Using this latent variable, we model the probability of the 





((3x^ + cosx)e^ — e^{x^ + sinx))/e^^ 
{x^ + sinx))/e^ 


((3a;^ + cosx)e^ — + sinx 

= {3x^ + cos X — {x^ + sin x))/ 


(3x^ + x^ -\- cos X + sin x )/ 




Figure 3: Illustration of the clusters obtained by MLP-S by applying affinity propagation (AP) on the similarity matrix S corre¬ 
sponding to learners’ solutions to four different mathematical questions (see Table 1 for more details about the datasets and the 
Appendix for the question statements). Each node corresponds to a solution. Nodes with the same color correspond to solutions 
that are estimated to be in the same cluster. The thickness of the edge between two solutions is proportional to their similarity 
score. Boxed solutions are correct; all others are in varying degrees of correctness. 


solution of all learners’ solutions to the question as 


where yj, the j* column of the data matrix Y, corresponds to 
learner j’s solution to the question. Here we have implicitly 
assumed that the learners’ solutions are independent of each 
other. By analogy to topic models [4, 35], we assume that 
learner j’s solution to the question, y^, is generated according 
to a multinomial distribution given the cluster assignments z 
as 


p{Y) = n 



P{yMi = k)= Mult{yj\4)k) 




2 , 




V,k ’ 


(3) 


where [0,1]^^^ is a parameter matrix with denot¬ 
ing its (i;, ky^ entry. 0/. G [0,1]^^^ denotes the /c* column 


of ^ and charcterizes the multinomial distribution over all the 
V features for cluster k. 

In practice, one often has no information regarding the num¬ 
ber of clusters K. Therefore, we consider K as an unknown 
parameter and infer it from the solution data. In order to do 
so, we impose a Chinese restaurant process (CRP) prior on 
the cluster assignments z, parameterized by a parameter a. 
The CRP characterizes the random partition of data into clus¬ 
ters, in analogy to the seating process of customers in a Chi¬ 
nese restaurant. It is widely used in Bayesian mixture model¬ 
ing literature [3, 14]. Under the CRP prior, the cluster (table) 
assignment of the j* solution (customer), conditioned on the 
cluster assignments of all the other solutions, follows the dis¬ 
tribution 


p{zj = k\z^j, 



N-l+a 

a 


if cluster k is occupied, 
if cluster k is empty, 

( 4 ) 


















Figure 4: Graphical model of the generation process of solu¬ 
tions to mathematical questions, <^/3 and p are hyperpa¬ 
rameters, z and ^ are latent variables to be inferred, and Y is 
the observed data defined in (1). 


where represents the number of solutions that belong to 

cluster k excluding the current solution j, with X]/c=i = 
N — 1. The vector z^j represents the cluster assignments of 
the other solutions. The flexibility of allowing any solution to 
start a new cluster of its own enables us to automatically in¬ 
fer K from data. It is known 137] that the expected number of 
clusters under the CRP prior satisfies K ^ 0{a logY) ^ Y, 
so our method scales well as the number of learners N grows 
large. We also impose a Gamma prior a ^ Gam{ao,^OLp) on 
a to help us infer its value. 

Since the solution feature data Y is assumed to follow a multi¬ 
nomial distribution parameterized by we impose a sym¬ 
metric Dirichlet prior over ^ as 0/^ ^ Dir{(j)k\p) because of 
its conjugacy with the multinomial distribution 113]. 

The graphical model representation of our model is visualized 
in Figure 4. Our goal next is to estimate the cluster assign¬ 
ments z for the solution of each learner, the parameters (j)^ of 
each cluster, and the number of clusters K, from the binary¬ 
valued solution feature data matrix Y. 


Clustering Solutions in MLP-B 

We use a Gibbs sampling algorithm for posterior inference 
under the MLP-B model, which automatically groups solu¬ 
tions into clusters. We start by applying a generic clustering 
algorithm (e.g., -means, with K = Y/10) to initialize z, 
and then initialize ^ accordingly. Then, in each iteration of 
MLP-B, we perform the following steps: 


1. Sample z: For each solution j, we remove it from its cur¬ 
rent cluster and sample its cluster assignment Zj from the 
posterior = k\z^j^a^Y). Using Bayes rule, we have 

p{zj = k\z^j,^,a,Y) =p{zj = 

ocp{zj=k\z^j,a)p{yj\zj=k,4>k)- 

The prior probability p{zj = k\z^j, a) is given by (4). For 
non-empty clusters, the observed data likelihood p(yj | Zj = 
/c, is given by (3). However, this does not apply to new 
clusters that are previously empty. For a new cluster, we 
marginalize out 4>k, resulting in 


p{yj\zj = k, 



fc,0fc)p(0fe|/3) 

= k,4>k)Dir{cf>k\/3) 


T{Vp) 

r(Er=ini + ^/3),Li r(/3) 


where r(-) is the Gamma function. 

If a cluster becomes empty after we remove a solution from 
its current cluster, then we remove it from our sampling 
process and erase its corresponding multinomial parame¬ 
ter vector (j)k- If a new cluster is sampled for Zj , then we 
sample its multinomial parameter vector 0/^ immediately 
according to Step 2 below. Otherwise, we do not change 
(j)k until we have finished sampling z for all solutions. 

2. Sample For each cluster k, sample (j)k from its pos¬ 
terior Dir{(t)k\ni^k + ?^y,/c + P), where rii^k is the 

number of times feature i occurs in the solutions that be¬ 
long to cluster k. 

3. Sample a: Sample a using the approach described in 141]. 

4. Update Update p using the fixed-point procedure de¬ 
scribed in 120]. 

The output of the Gibbs sampler is a series of samples that 
correspond to the approximate posterior distribution of the 
various parameters of interest. To make meaningful infer¬ 
ence for these parameters (such as the posterior mean of a pa¬ 
rameter), it is important to appropriately post-process these 
samples. For our estimate of the true number of clusters, K, 
we simply take the mode of the posterior distribution on the 
number of clusters K. We use only iterations with K = K io 
estimate the posterior statistics 139]. 

In mixture models, the issue of “label-switching” can cause a 
model to be unidentifiable, because the cluster labels can be 
arbitrarily permuted without affecting the data likelihood. In 
order to overcome this issue, we use an approach reported in 
139]. First, we compute the likelihood of the observed data 
in each iteration as p(Y|^^, z^), where and z^ represent 
the samples of these variables at the iteration. After the 
algorithm terminates, we search for the iteration ^max with the 
largest data likelihood and then permute the labels z^ in the 
other iterations to best match with . We use ^ (with 
columns 0/^) to denote the estimate of which is simply the 
posterior mean of Each solution j is assigned to the cluster 
indexed by the mode of the samples from the posterior of Zj , 
denoted by Zj . 

Auto-Grading via MLP-B 

We now detail how to use MLP-B to automatically grade a 
large number N of learners’ solutions to a mathematical ques¬ 
tion, using a small number K of instructor graded solutions. 
First, as in MLP-S, we select the setX^ of “typical solutions” 
for the instructor to grade. We construct by selecting one 
solution from each of the K clusters that is most representa¬ 
tive of the solutions in that cluster: 

Xb = {argmaxp(yj|0fc),fc = l,2,...,K}. 

J 

In words, for each cluster, we select the solution with the 
largest likelihood of being in that cluster. 

The instructor grades the K solutions in to form the set 
of instructor grades {g^} for k gIb- Using these grades, we 






assign grades to the other solutions j ^ Xb according to 
Ef=iP(yil0fc) ' 

That is, we grade each solution not in as the average of the 
instructor grades weighted by the likelihood that the solution 
belongs to cluster. We demonstrate the performance of auto¬ 
grading via M LP-B in the experimental results section below. 


Table 1: Datasets consisting of the solutions of 116 learners 
to 4 mathematical questions on algebra and signal processing. 
See the Appendix for the question statements. 



No.of solutions N 

No.of features (unique expressions) V 

Question 1 

108 

78 

Question 2 

113 

53 

Question 3 

90 

100 

Question 4 

110 

45 


Feedback Generation via MLP-B 

In addition to grading solutions, MLP-B can automatically 
provide useful feedback to learners on where they made errors 
in their solutions. 

For a particular solution j denoted by its column feature 
value vector yj with Vj total expressions, let ^ denote 
the feature value vector that corresponds to the first v ex¬ 
pressions of this solution, with v = {1,2,...,!^}. Un¬ 
der this notation, we evaluate the probability that the first v 
expressions of solution j belong to each of the K clusters: 

P(yj^^l^k), ^ = {1,2,..., AT}, for all v. Using these proba¬ 
bilities, we can also compute the expected credit of solution 
j after the first v expressions via 

.(v) ELiP(yPl<^k)ffk 

Sj = - W - 'TV ~-—’ 

Ef=iP(y?Vfe) 

where {^/c} is the set of instructor grades as defined above. 

Using these quantities, it is possible to identify that the learner 
has likely made an error at the expression if it is most 
likely to belong to a cluster with credit gk less than the full 
credit or, alternatively, if the expected credit ^ is less than 
the full credit. 

The ability to automatically locate where an error has been 
made in a particular incorrect solution provides many bene¬ 
fits. For instance, MLP-B can inform instructors of the most 
common locations of learner errors to help guide their instruc¬ 
tion. It can also enable an automated tutoring system to gen¬ 
erate feedback to a learner as they make an error in the early 
steps of a solution, before it propagates to later steps. We 
demonstrate the efficacy of MLP-B to automatically locate 
learner errors using real-world educational data in the exper¬ 
iments section below. 

EXPERIMENTS 

In this section, we demonstrate how MLP-S and MLP-B can 
be used to accurately estimate the grades of roughly 100 open 
response solutions to mathematical questions by only asking 
the course instructor to grade approximately 10 solutions. We 
also demonstrate how MLP-B can be used to automatically 
provide feedback to learners on the locations of errors in their 
solutions. 

Auto-Grading via MLP-S and MLP-B 
Datasets 

Our dataset that consists of 116 learners solving 4 open re¬ 
sponse mathematical questions in an edX course. The set 


of questions includes 2 high-school level mathematical ques¬ 
tions and 2 college-level signal processing questions (details 
about the questions can be found in Table 1, and the question 
statements are given in the Appendix). For each question, we 
pre-process the solutions to filter out the blank solutions and 
extract features. Using the features, we represent the solu¬ 
tions by the matrix Y in (1). Every solution was graded by the 
course instructor with one of the scores in the set {0,1, 2, 3}, 
with a full credit of 3. 


Baseline: Random sub-sampling 

We compare the auto-grading performance of MLP-S and 
MLP-B against a baseline method that does not group the so¬ 
lutions into clusters. In this method, we randomly sub-sample 
all solutions to form a small set of solutions for the instructor 
to grade. Then, each ungraded solution is simply assigned the 
grade of the solution in the set of instructor-graded solutions 
that is most similar to it as defined by S in (2). Since this 
small set is picked randomly, we run the baseline method 10 
times and report the best performance.^ 

Experimental setup 

For each question, we apply four different methods for auto¬ 
grading: 

• Random sub-sampling (RS) with the number of clusters 
K e {5, 6 ,..., 40}. 

• MLP-S with spectral clustering (SC) with K e 
{5,6,..., 40}. 

• MLP-S with affinity propagation (AP) clustering. This al¬ 
gorithm does not require K as an input. 

• MLP-B with hyperparameters set to the non-informative 

values = I and running the Gibbs sampling al¬ 

gorithm for 10,000 iterations with 2,000 burn-in iterations. 

MLP-S with AP and MLP-B both automatically estimate the 
number of clusters K. Once the clusters are selected, we as¬ 
sign one solution from each cluster to be graded by the in¬ 
structor using the methods described in earlier sections. 


Performance metric 

We use mean absolute error (MAE), which measures the “av¬ 
erage absolute error per auto-graded solution” 


MAE = 


N-K 


^ Other baseline methods, such as the linear regression-based method 
used in the edX essay grading system [33], are not listed, because 
they did not perform as well as random sub-sampling in our experi¬ 
ments. 




as our performance metric. Here, N — K equals the num¬ 
ber of solutions that are auto-graded, and gj and gj repre¬ 
sent the estimated grade (for MLP-B, the estimated grades 
are rounded to integers) and the actual instructor grades for 
the auto-graded solutions, respectively. 

Results and discussion 

In Figure 5, we plot the MAE versus the number of clusters 
K for Questions 1-4. MLP-S with SC consistently outper¬ 
forms the random sampling baseline algorithm for almost all 
values of K. This performance gain is likely due to the fact 
that the baseline method does not cluster the solutions and 
thus does not select a good subset of solutions for the instruc¬ 
tor to grade. MLP-B is more accurate than MLP-S with both 
SC and AP and can automatically estimate the value of K, al¬ 
though at the price of significantly higher computational com¬ 
plexity (e.g., clustering and auto-grading one question takes 
2 minutes for MLP-B compared to only 5 seconds for M LP-S 
with AP on a standard laptop computer with a 2.8GHz CPU 
and 8GB memory). 




(a) Question 1 


(b) Question 2 




(c) Question 3 (d) Question 4 


Both MLP-S and MLP-B grade the learners’ solutions accu¬ 
rately (e.g., an MAE of 0.04 out of the full grade 3 using only 
K = 13 instructor grades to auto-grade all = 113 solu¬ 
tions to Question 2). Moreover, as we see in Figure 5, the 
MAE for MLP-S decreases as K increases, and eventually 
reaches 0 when K is large enough that only solutions that are 
exactly the same as each other belong to the same cluster. In 
practice, one can tune the value of K to achieve a balance be¬ 
tween maximizing grading accuracy and minimizing human 
effort. Such a tuning process is not necessary for MLP-B, 
since it automatically estimates the value of K and achieves 
such a balance. 

Feedback Generation via MLP-B 
Experimental setup 

Since Questions 3-4 require some familiarity with signal pro¬ 
cessing, we demonstrate the efficacy of MLP-B in provid¬ 
ing feedback on mathematical solutions on Questions 1-2. 
Among the solutions to each question, there are a few types 
of common errors that more than one learner makes. We take 
one incorrect solution out of each type and run M LP-B on the 
other solutions to estimate the parameter (j)^ for each clus¬ 
ter. Using this information and the instructor grades {gk}, 
after each expression v in a. solution, we compute the proba¬ 
bility that it belongs to a cluster p(yj^^ 10/c) that does not have 
full credit (gk < 3), together with the expected credit using 
(6). Once the expected grade is calculated to be less than full 
credit, we consider that an error has occurred. 


Figure 5: Mean absolute error (MAE) versus the number of 
instructor graded solutions (clusters) K, for Questions 1- 
4, respectively. For example, on Question 1, MLP-S and 
MLP-B estimate the true grade of each solution with an aver¬ 
age error of around 0.1 out of a full credit of 3. “RS” repre¬ 
sents the random sub-sampling baseline. Both MLP-S meth¬ 
ods and MLP-B outperforms the baseline method. 


We have developed a framework for mathematical language 
processing (MLP) that consists of three main steps: (/) con¬ 
verting each solution to an open response mathematical ques¬ 
tion into a series of numerical features; (ii) clustering the fea¬ 
tures from several solutions to uncover the structures of cor¬ 
rect, partially correct, and incorrect solutions; and (///) auto¬ 
matically grading the remaining (potentially large number of) 
solutions based on their assigned cluster and one instructor- 
provided grade per cluster. As our experiments have indi¬ 
cated, our framework can substantially reduce the human ef¬ 
fort required for grading in large-scale courses. As a bonus, 
MLP-S enables instructors to visualize the clusters of solu¬ 
tions to help them identify common errors and thus groups of 
learners having the same misconceptions. As a further bonus, 
MLP-B can track the cluster assignment of each step of a mul¬ 
tistep solution and determine when it departs from a cluster 
of correct solutions, which enables us to indicate the locations 
of errors to learners in real time. Improved learning outcomes 
should result from these innovations. 


Results and discussion 

Two sample feedback generation process are shown in Fig¬ 
ure 6. In Figure 6(a), we can provide feedback to the learner 
on their error as early as Line 2, before it carries over to later 
lines. Thus, MLP-B can potentially become a powerful tool 
to generate timely feedback to learners as they are solving 
mathematical questions, by analyzing the solutions it gathers 
from other learners. 

CONCLUSIONS 


There are several avenues for continued research. We are cur¬ 
rently planning more extensive experiments on the edX plat¬ 
form involving tens of thousands of learners. We are also 
planning to extend the feature extraction step to take into ac¬ 
count both the ordering of expressions and ancillary text in 
a solution. Clustering algorithms that allow a solution to be¬ 
long to more than one cluster could make MLP more robust 
to outlier solutions and further reduce the number of solutions 
that the instructors need to grade. Finally, it would be inter¬ 
esting to explore how the features of solutions could be used 















((*3 +sin a;)/e®y 

= + sin*)' - (*® + sin*)(e“^)')/e2“^ 

prob. incorrect = 0.11, exp.grade = 3 
= (e^(2x^ + cosx) — {x^ + sin x)e^) / 
prob. incorrect = 0.66, exp.grade = 2 
= (2x^ + cos X — x^ — sin x) / 
prob. incorrect = 0.93, exp.grade = 2 
= {x^ (2 — x) + cos X — sin x) / 
prob. incorrect = 0.99, exp.grade = 2 

(a) A sample feedback generation process where the learner makes an 
error in the expression in Line 2 while attempting to solve Question 1. 

+ X + sin^ X + cos^ x)(2x — 3) 

= (x^ + X + l)(2x — 3) 

prob. incorrect = 0.09, exp.grade = 3 
= 4x^ + 2x^ + 2x — 3x^ — 3x — 3 
prob. incorrect = 0.82, exp.grade = 2 
= 4x^ — x^ — X — 3 
prob. incorrect = 0.99, exp.grade = 2 

(b) A sample feedback generation process where the learner makes an 
error in the expression in Line 3 while attempting to solve Question 2. 

Figure 6: Demonstration of real-time feedback generation by 
MLP-B while learners enter their solutions. After each ex¬ 
pression, we compute both the probability that the learner’s 
solution belongs to a cluster that does not have full credit and 
the learner’s expected grade. An alert is generated when the 
expected credit is less than full credit. 


to build predictive models, as in the Rasch model [27] or item 
response theory [18]. 


APPENDIX: QUESTION STATEMENTS 

Question 1: Multiply 

+ X -h sin^ X -h cos^ x)(2x — 3) 
and simplify your answer as much as possible. 


Question 2: Find the derivative of 
your answer as much as possible. 


x^ -h sin(x) 


and simplify 
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